{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d2724018-826d-44d7-a851-b780bd4df3c7",
   "metadata": {},
   "source": [
    "# Movie Recommender System\n",
    "\n",
    "In this example you will create a movie recommender system.\n",
    "\n",
    "The system will extract feature vectors from metadata about films using SentenceTransformers, import those vectors into Milvus, with the metadata. When a user submits information about movies they're interested in, you'll search Milvus for similar films and provide searchers movie info from Redis using the results.\n",
    "\n",
    "## Requirements\n",
    "- Python 3.x.\n",
    "- Docker\n",
    "- A system with at least 32GB of RAM, or a Zilliz cloud account"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65c448ff-b3e3-4890-abfe-eadc62fea901",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "In this project, you'll use [The Movies Dataset from Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data). This dataset contains metadata on more than 45k movies.\n",
    "\n",
    "The dataset has several files, but you'll only need **movies_metadata.csv,** the main Movies Metadata file. You can use this notebook as a starting point and modify it to take advantage of the rest of this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b00c36ff-1c17-4adb-b927-8ce39a58dc7d",
   "metadata": {},
   "source": [
    "# Requirements\n",
    "\n",
    "First, install the Python packages needed for this project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02cd7b3a-bca2-4e47-b0f6-78eef8f75f10",
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m pip install pymilvus redis pandas sentence_transformers kaggle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c187a23-5867-40a2-ac86-891444bac289",
   "metadata": {},
   "source": [
    "## Download dataset\n",
    "\n",
    "Now you'll download the dataset. You'll use the [Kaggle API](https://github.com/Kaggle/kaggle-api) to retrieve the data. \n",
    "\n",
    "Set your login information below, or download **kaggle.json** to a location where the API will find it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2be6498e-73d5-4807-9783-59a10b73deab",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env KAGGLE_USERNAME=username\n",
    "%env KAGGLE_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\n",
    "\n",
    "%env TOKENIZERS_PARALLELISM=true"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7614a1b4-b626-4523-a4d2-4ac334987a03",
   "metadata": {},
   "source": [
    "Download the data and unzip it to the **dataset** directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0288baea-48eb-41b1-a09b-ccbc85a5c3b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import kaggle\n",
    "\n",
    "kaggle.api.authenticate()\n",
    "kaggle.api.dataset_download_files('rounakbanik/the-movies-dataset', path='dataset', unzip=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0069a1c9-055f-4ae5-bb12-ffe2a73e6bdc",
   "metadata": {},
   "source": [
    "## Milvus Server\n",
    "\n",
    "You're going to create vector embeddings from the movie's descriptions. So, you need a way to store, index, and search on those embeddngs. That's where Milvus comes in.\n",
    "\n",
    "This is a relatively large dataset, at least for a server running on a personal computer. So, you may want to use a Zilliz Cloud instance to store these vectors.\n",
    "\n",
    "But, if you want to stay with a local instance, you can download a docker compose configuration and run that.\n",
    "\n",
    "Here's how to get the compose file, and start the server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62053db1-7c8e-4397-9561-ca7133219864",
   "metadata": {},
   "outputs": [],
   "source": [
    "! wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml\n",
    "! docker-compose up -d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02719fc2-cf6d-4d9f-b3a1-0101feb9ca59",
   "metadata": {},
   "source": [
    "But, if you want to use cloud, sign up for an account [here](https://cloud.zilliz.com)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c4bb191-8c5e-4c30-a895-1430e3f78153",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "movies=pd.read_csv('dataset/movies_metadata.csv',low_memory=False)\n",
    "movies.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67517d76-baaf-4d31-8fee-bd684ca3ff5e",
   "metadata": {},
   "source": [
    "Youhave more than 45k movies, with 24 columns of metadata.\n",
    "\n",
    "List the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61545790-58c2-4a4d-9e87-97b6122db302",
   "metadata": {},
   "outputs": [],
   "source": [
    "movies.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4da937d6-5e55-4c4b-8024-1f18b1977036",
   "metadata": {},
   "source": [
    "There's no need to store all these columns in Milvus. Trim them down to the metatdata we want to store with the vectors and remove any items that are missing critical fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "133ef535-f3af-480e-9b62-7240341b5c1c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from math import isnan\n",
    "from pprint import pprint\n",
    "\n",
    "trimmed_movies = movies[[\"title\", \"overview\", \"release_date\", \"genres\"]]\n",
    "trimmed_movies.head(4)\n",
    "\n",
    "\n",
    "unclean_movies_dict = trimmed_movies.to_dict('records')\n",
    "print('{} movies'.format(len(unclean_movies_dict)))\n",
    "movies_dict = []\n",
    "\n",
    "for movie in unclean_movies_dict:\n",
    "    if  movie[\"overview\"] == movie[\"overview\"] and movie[\"release_date\"] == movie[\"release_date\"] and movie[\"genres\"] == movie[\"genres\"] and movie[\"title\"] == movie[\"title\"]:\n",
    "        movies_dict.append(movie)\n",
    "\n",
    "print('{} movies'.format(len(movies_dict)))\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a7cdd27a-476c-4c94-990f-a412adbd1635",
   "metadata": {},
   "source": [
    "Now, it's time to connect to Milvus so you can start uploading data.\n",
    "\n",
    "Here's the code for connecting to a cloud instance. Replace the URI and TOKEN with the correct values for your instance. \n",
    "\n",
    "You can find them in your Zilliz dashboard:\n",
    "![image.png](cluster_info.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66082fd8-05e5-435b-a077-dd378d2563ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pymilvus import *\n",
    "\n",
    "milvus_uri=\"XXXXXXX\"\n",
    "token=\"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\"\n",
    "connections.connect(\"default\",\n",
    "                        uri=milvus_uri,\n",
    "                        token=token)\n",
    "print(\"Connected!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68ec91fd-65b0-48b1-86e3-edb279a4be50",
   "metadata": {},
   "source": [
    "So, with the meta data stored in Redis, it's time to calculate the embeddings and add them to Milvus.\n",
    "\n",
    "First, you need a collection to store them in. Create a simple one that stores the title and embeddings for in the **Movies** field, while also allowing dynamic fields. You'll use the dynamic fields for metadata.\n",
    "\n",
    "Then, you'll index the embedding field to make searches more efficent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "aa7ab317-a6f9-48bf-9c80-1792537c99ab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collection created.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "alloc_timestamp unimplemented, ignore it\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collection indexed!\n"
     ]
    }
   ],
   "source": [
    "COLLECTION_NAME = 'film_vectors'\n",
    "PARTITION_NAME = 'Movie'\n",
    "\n",
    "# Here's our record schema\n",
    "\"\"\"\n",
    "\"title\": Film title,\n",
    "\"overview\": description,\n",
    "\"release_date\": film release date,\n",
    "\"genres\": film generes,\n",
    "\"embedding\": embedding\n",
    "\"\"\"\n",
    "\n",
    "id = FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=500, is_primary=True)\n",
    "field = FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=384)\n",
    "\n",
    "schema = CollectionSchema(fields=[id, field], description=\"movie recommender: film vectors\", enable_dynamic_field=True)\n",
    "\n",
    "if utility.has_collection(COLLECTION_NAME): # drop the same collection created before\n",
    "    collection = Collection(COLLECTION_NAME)\n",
    "    collection.drop()\n",
    "    \n",
    "collection = Collection(name=COLLECTION_NAME, schema=schema)\n",
    "print(\"Collection created.\")\n",
    "\n",
    "index_params = {\n",
    "    \"index_type\": \"IVF_FLAT\",\n",
    "    \"metric_type\": \"L2\",\n",
    "    \"params\": {\"nlist\": 128},\n",
    "}\n",
    "\n",
    "collection.create_index(field_name=\"embedding\", index_params=index_params)\n",
    "collection.load()\n",
    "\n",
    "print(\"Collection indexed!\")\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "613e6c3c-621b-40fc-a7de-a1f5dde37ba9",
   "metadata": {},
   "source": [
    "Now, you need a function to create the embeddings.\n",
    "\n",
    "The primary artifact for movie information is the overview, but including the genre and release date in complete sentences may help with search accuracy.\n",
    "\n",
    "Create a transformer and call it from a simple function:\n",
    "- extract the id field\n",
    "- creates an embed from the overview, genre and release date\n",
    "- inserts the vector into Milvus.\n",
    "\n",
    "You'll reuse the **build_genres** function below for searching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730650c1-57c2-4099-a352-2f67d8e77dc8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "import ast\n",
    "\n",
    "def build_genres(data):\n",
    "    genres = data['genres']\n",
    "    genre_list = \"\"\n",
    "    entries= ast.literal_eval(genres)\n",
    "    genres = \"\"\n",
    "    for entry in entries:\n",
    "        genre_list = genre_list + entry[\"name\"] + \", \"\n",
    "    genres += genre_list\n",
    "    genres = \"\".join(genres.rsplit(\",\", 1))\n",
    "    return genres\n",
    "\n",
    "transformer = SentenceTransformer('all-MiniLM-L6-v2')\n",
    "\n",
    "def embed_movie(data):\n",
    "    embed = \"{} Released on {}. Genres are {}.\".format(data[\"overview\"], data[\"release_date\"], build_genres(data))    \n",
    "    embeddings = transformer.encode(embed)\n",
    "    return embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "447355dd-b82b-4660-b192-f614918901fa",
   "metadata": {},
   "source": [
    "Now, you can create the embeddings. This dataset is too large to send to Milvus in a single insert statement, but sending them one at a time would create unnecessary network traffic and add too much time. So, this code uses batches. You can play with the batch size to suit your individual needs and preferences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7707e54-1564-4940-983b-97a24b678b0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop counter for batching and showing progress\n",
    "j = 0\n",
    "batch = []\n",
    "\n",
    "for movie_dict in movies_dict:\n",
    "    try:\n",
    "        movie_dict[\"embedding\"] = embed_movie(movie_dict)\n",
    "        batch.append(movie_dict)\n",
    "        j += 1\n",
    "        if j % 5 == 0:\n",
    "            print(\"Embedded {} records\".format(j))\n",
    "            collection.insert(batch)\n",
    "            print(\"Batch insert completed\")\n",
    "            batch=[]\n",
    "    except Exception as e:\n",
    "        print(\"Error inserting record {}\".format(e))\n",
    "        pprint(batch)\n",
    "        break\n",
    "\n",
    "collection.insert(movie_dict)\n",
    "print(\"Final batch completed\")\n",
    "print(\"Finished with {} embeddings\".format(j))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc8c2791-4eaa-4554-9b4b-7b482b3684b3",
   "metadata": {},
   "source": [
    "Now you can search for movies that match viewer criteria. To do this, you need a few more functions.\n",
    "\n",
    "First, you need a transformer to convert the user's search string to an embedding. For this, **embed_search** takes their criteria and passed it to the same transformer you used to populate Milvus.\n",
    "\n",
    "By setting the title and overview fields in the return set, you can simply print the result set for the user.\n",
    "\n",
    "Finally, **search_for_movies** performs the actual vector search, using the other two functions for support."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4abea5ad-fb27-4d67-8ff2-6055381e5d7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "collection.load() # load collection memory before search\n",
    "\n",
    "# Set search parameters\n",
    "topK = 5\n",
    "SEARCH_PARAM = {\n",
    "    \"metric_type\":\"L2\",\n",
    "    \"params\":{\"nprobe\": 20},\n",
    "}\n",
    "\n",
    "\n",
    "def embed_search(search_string):\n",
    "    search_embeddings = transformer.encode(search_string)\n",
    "    return search_embeddings\n",
    "\n",
    "\n",
    "def search_for_movies(search_string):\n",
    "    user_vector = embed_search(search_string)\n",
    "    return collection.search([user_vector],\"embedding\",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=['title', 'overview'])\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27819413-6181-4975-957a-a204a5ecd915",
   "metadata": {},
   "source": [
    "So, put this search to work!\n",
    "\n",
    "This search is looking for 1990s comedies with Vampires. The first hit is exactly that, but as the vector distance increases you can see that the films move further away from what you're looking for.\n",
    "\n",
    "You can play around with different search criteria."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e2bb00e-d400-4d54-94d6-894002546f3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "\n",
    "search_string = \"A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire.\"\n",
    "results = search_for_movies(search_string)\n",
    "\n",
    "for hits in iter(results):\n",
    "    for hit in hits:\n",
    "        print(hit.entity.get('title'))\n",
    "        print(hit.entity.get('overview'))\n",
    "        print(\"-------------------------------\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f64d43d-9250-472e-aa6f-281ec53bd95d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
